.
THE OFFICE - TEXT ANALYSIS
This project’s main purpose is to analyze a TV show in a reliable and measurable way, without the need to watch the whole show or rely on a personal perspective. The selected subject for this analysis is the sitcom ‘The Office’, which was selected mainly for the high availability of data.
This notebook will use the data previously collected and cleaned, to go through the analysis process.
The non-standard libraries used in this notebook are:
Pandas
for data wrangling;
Numpy, Sklearn, sciPy, and spaCy
for mathematical, statistical and machine-learning related tasks;
NetworkX, matplotlib, Seaborn
for visualizations;
The first step to analyze the show is to define who are the main characters.
This may have several different interpretations, but for this project, we are considering some metrics to do so. The points to be considered are:
The number of dialogs and episodes is the main indicator of who are the main characters since the characters who had the biggest proportions of dialog and appeared in most of the episodes receive more attention and therefore should be the main characters.
There's a challenge in this part because of special guests and characters which had very big importance for a short amount of time. Those characters appear to had lots of dialogs, they participate in lots of episodes, but they're just around for a couple of seasons at most, this is why we're considering the season to calculate the main characters score.
To solve this issue I developed a score that considers all the above-mentioned approaches.
We start by aggregating the numerical fields and get their respective descriptive statistics such as means, standard deviations, medians, and other aggregations
This is the indicator I developed to help to find the main characters of the series and classify them by relevance to the show.
Score = nep + (nd / nep) * (ns/5)
nep = number episodes;
nd = number of dialogs;
ns = number of seasons;
5 is a threshold I used.
The idea is to "penalize" characters that appeared in less than 5 seasons (approx. half the series) and give more significance to characters that appeared in more than 5 seasons.
Main characters according to the score are:
To test our score we can compare the distributions for our selected variables.
In the bellow chart the values are displayed as:
The chart compares the main characters(red) selected by the score with all the other characters(blue).
We can also see how the score is handling the 'Average dialogs', in the bellow chart we have:
A very interesting characteristic we can analyze is the number of words and sentences a character says, characters with a high average of those are the ones who have lots to talk about, they don't just react to situations, they have something to add to it.
We can say that subjectivity is something like the number of words you said minus what you're actually saying. So saying lots of words to pass a small message means there's lots of subjectivity
In the bellow displayed chart, we can see the blue bars representing the means, and the black lines the standard deviation of those, the problem, in this case, is that there's a huge difference between the mean and the standard deviations. This means our data have extreme outliers, so the averages are not such a good indication of who talks more or less, they just give us a slight idea of it.
Since we're not able to get the full understanding with the means, we can analyse the medians for those characters.
Medians are more interesting, there is one character in specific that is different from everyone else. Kevin, from all the main characters uses in median the lowest amount of words per dialogs.
And we can easily find evidence of that.
https://www.youtube.com/watch?v=_K-L9uhsBLM
To analyse the overall sentiment polarity of the show and it's main characters we're using VADER, please consult the data cleaning and preparation notebook for more information about this method and its implementation.
The visualization of the results aims at displaying the characters of the show, and the average positive and negative sentiments for each of them.
For a fair perspective of those values we're comparing them in the same scales where:
So the range of 0.04 to 0.23, is applied to both the x and y axis.
We can see that most characters have a similar behavior in matters of polarity in their dialogs, the values concentrate in high positive and low negative for the vast majority of them, but we can also see some outliers away from the group.
As mentioned before, most of the characters have a high positive score of around 0.14 to 0.20, with a low negative score of 0.06 to 0.8.
But we can note some characters with higher negative scores and also a character with a lower positive score.
Stanley, is the most distant from the other characters, he has a relatively low positive score but his negative score isn't so high either.
This means his dialogs are mostly neutral, almost like he doesn't want to get involved. https://www.youtube.com/watch?v=iahcJPo9Dwg
The file 'conversations.json' contains one record for every scene on the show, where the record contains the name of the characters that had some dialog in the scene and the respective number of dialogs that character had.
These conversations will be used to calculate a score for the relations between the characters.
In order to compare the relationship between the characters the following formula was developed:
Where:
nx = number of dialogs character x had in a conversation;
ny = number of dialogs character y had in a conversation;
This score is based on the concept that a perfectly balanced conversation will have the same amount of dialogs between both agents.
E.g.: A conversation with three characters x, y and z;
Where x said 5 dialogs, y said 5 dialogs, and z said 1 dialog will result in a score between x and y of 1, while the score between x and z will be 0.2.
The scores are them aggregated with all scores from the same relation so they can be compared, it's important to note that this will result in generally higher scores for characters that communicate a lot and lower scores for characters that don't.
After calculating the relationship scores for every character of the show we have the following data:
At this point we'll start comparing the relationships and describing them as 'strong' or 'weak', depending on the value of their scores. It's important to note that a strong relationship in this context doesn't relate to the sentiment involved between the characters, so it won't necessarily be a positive relation.
In this context, a strong relationship means the characters communicate a lot.
By themselves the scores are already very meaningful, we can tell that Pam and Jim have the strongest relationship among all the other relations.
We can also notice that Michael, the main character of the show, has an overall higher score with everybody when compared to 'lower-ranked' main characters such as Meredith, Creed, or Darryl.
This makes sense from the perspective that Michael has been communicating more constantly with everybody in the show, so he probably has a stronger relationship with most characters.
To extract even more information about the relationships we can normalize the scores, in this case we'll do so by standarizing the values, or calculating their z-scores. This will allow us to see how many standard deviations aways from the mean each relation is.
Simplyfing, we want to see how extreme are those relationships for each characther.
One way of improving this visualization is by showing the actual p-values, they represent how likelly it should be to find those values in the distribution.
In this case, we'll look for relationships with a lower than 0.05 p-value, to account for 95% of confidence level that those relationships have a statistically significant difference from the average relationships of the analysed characters.
With 95% of confidence, the bellow listed relationships had a higher amount of conversation score than the average relationships.
Michael -> Dwight
Dwight -> Michael
Dwight -> Jim
Jim -> Dwight
Jim -> Pam
Pam -> Jim
Angela -> Dwight
Andy -> Dwight
Darryl -> Andy
Ryan -> Michael
Stanley -> Phyllis
Visualize the strongest relationships in a network chart
The word and terms frequency can give us an interesting perspective of how the characters communicate and what the show is about.
To start we can visualize the show's most frequent words in a word cloud, to do that we're using a bag-of-words algorithm that'll select and display the words and terms with the highest frequency.
We can see in the above visualization that many of the words relate to people, words such as names, and pronouns are very common in their daily communications. We can also see that many of those words have little to no meaning by themselves.
To improve on that we can check what are the distinguishable terms spoken by the characters, in other words, we'll remove words that are common to all characters and focus on the words that are specific to each of the main characters.
Term Frequency - Inverted document Frequency (TF-IDF), is a way method to compare how many times a term appeared in a document with how many documents the term appeared in.
Term Frequency( t, d ) * Inverse Document Frequency( t )
Term = t
Document = d
Get the difference between the mean score for all characters and the character score, this will show how above or bellow the average each words was said by character.
The result is then sorted to get the most above the average words for each character.
The 10 most distinct words by character
One of the many ways of breaking down all this data is by analysing the characters individually, from this point on the previously discussed methods will be addapted for a single character.
Beside from the previously seem data, in this section we'll also explore the ratings.
The polarity scores for each dialog were generated by VADER, please consult the data cleaning and preparation notebook for more information about this method and its implementation.
The sentiment analysis displays high amounts of Neutral interactions and low amounts of negative and positive for most characters. To better visualize the small differences between those scores we can normalize them.
To visualize the three normalized variables (positive, negative, and neutral), we'll be using radar charts, with the normalized data we can more easily compare the extents of each polarity in the selected character.
We can also visualize the distribution of the polarity trough the episodes, this should allow us to see changes in the character behavior and outliers that may be interesting to look closer.
In this section, we'll repeat the methods used in '5 - Words Frequency', but this time with a single character, and we'll also add a method from spaCy, that can help us identify the entities mentioned in the dialogs.
Here we can analyze the most distinguishable terms for a specific character, the sizes are adjusted as per the more distinguishable the term the bigger the font size.
Here we're building a word cloud with the most frequent terms the character said, the cleaned version of the text is being used for visualization.
Here we'll visualize what are the most commonly mentioned entities, more specifically in this section we'll filter people, organizations, products, locations and events mentioned in the dialogs and them we'll count them to visualize the most mentioned in the show by the selected character
In regards to Michael, we can see something common between the words and terms frequencies. They're all strongly related to people.
In the TF-IDF scores, Michael's most distinguishable words have 2 pronouns (Everybody, and Somebody) and 5 names from the top 10 words. In the bag-of-words algorithm its harder to visualize the patterns since there are many meaningless words, but still, we can also see lots of names and pronouns related to people.
The strongest evidence of this is the most frequent entities mentioned by Michael, from the 15 words displayed only one is not a person, and this exception is actually the name of their city. This suggests that Michael is someone whose biggest interests are in people and the community.
https://www.youtube.com/watch?v=vrPgsrfZWOU&feature=youtu.be&t=327
Here we can verify the correlation (Pearson method) between the previously analysed measures and the actual ratings for the episodes
We can also compare any given variable with the actual ratings, this helps us visualize how much related those values are.